Capstone Project¶

Wine Quality Prediction¶

Introduction¶

Wine is a beverage made from fermented grape and other fruit juices with lower amount of alcohol content. Quality of wine is graded based on the taste of wine and vintage. This process is time taking, costly and not efficient. A wine itself includes different parameters like fixed acidity, volatile acidity, citric acid, residual sugar chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol and quality.

Problem statement¶

In industries, understanding the demands of wine safety testing can be a complex task for the laboratory th numerous analytes and residues to monitor But, our applications prediction, provide ideal solutions for the analysis of wine, which will make this whole process efficient and cheaper with less human interaction.

Objective¶

Our main objective is to predict the wine quality using machine learning through Python programming language

A large dataset is considered and wine quality modelled to analyse the quality of wine through different parameters like fixed acidity, volatile acidity etc. All these parameters will be analyse through Machine Leaming algorithms like random forest classifier algorithm which will helps to rate wine on scale 1-10 or bad-good. Output obtained would further be checked for correctness and model will be optimized accordingly.

It can support the wine expert evaluation and ultimately improve the production.

About Data¶

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

  1. Title: Wine Quality

  2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

  3. Past Usage:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

  4. Relevant Information:

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

  5. Number of Instances: red wine - 1599; white wine - 4898.

  6. Number of Attributes: 11 + output attribute

    Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

  7. Attribute information:

    For more information, read [Cortez et al., 2009].

    Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  8. Missing Attribute Values: None

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

import time

sns.set()
%matplotlib inline
In [2]:
# Read data set 
data = pd.read_csv('QualityPrediction.csv')
data.head()
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

We have 11 features and one target.

Lets understand more about the data.

In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

We have total 1599 records with non Null and numeric 11 features and one target variable 'qulaity'

In [4]:
data.describe()
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

Here its very clear that we have all numeric fields and none of them is having any missing values. max and mean value of the total sulfer dioxide is too high than other feature values so reuired to do feature scalling.

Exploratory Data Analysis¶

In [5]:
# Import pandas_profiling to get Initial EDA.
import pandas_profiling as pp
In [6]:
pp.ProfileReport(data)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[6]:

As per above EDA we have 220 rows (13.8% of total data set) as duplicate rows. Duplicate rows will add unnecessary noise into our data so this will weaken our model, So we can remove these duplicates.

In [7]:
# Remove duplicate records and keep one record each.
data.drop_duplicates(keep='first')
Out[7]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
5 7.4 0.660 0.00 1.8 0.075 13.0 40.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1593 6.8 0.620 0.08 1.9 0.068 28.0 38.0 0.99651 3.42 0.82 9.5 6
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1359 rows × 12 columns

In [8]:
# get target values count
data.groupby('quality')['quality'].count()
Out[8]:
quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: quality, dtype: int64
In [9]:
# create a barplot to compare the counts of the quality 
sns.countplot(data['quality'])
plt.title('Original Distribution of the Quality')
Out[9]:
Text(0.5, 1.0, 'Original Distribution of the Quality')

Each expert graded the wine quality between 0 (very bad) to 10 (very excellent). We can split quality into two categories.

  1. Bad Quality (0-5 grade)
  2. Good Quality (6-10 grade)

So let do it. Bad Quality (0) Good Quality (1)

In [10]:
# Mapping quality as 0 or 1
data['quality_cat'] = [1 if x > 5 else 0 for x in data.quality]
data.head()
Out[10]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality quality_cat
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 0
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 0
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 0
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 1
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 0
In [11]:
# create bar plot for new category with (0 and 1)
sns.countplot(data['quality_cat'])
plt.title('New Distribution of the Quality')
Out[11]:
Text(0.5, 1.0, 'New Distribution of the Quality')

Quality_cat is looks like balanced data. Which is one of important requirement while dealing with classification problems.

Correlation¶

During EDA we found some correlation between features. Lets see correlation using heatmap

In [12]:
# Plot heatmap to see if any features are highly correlated so we can remove it.
plt.figure(figsize=(15,10))
cor_plot = sns.heatmap(data.corr(), cmap="YlGnBu", annot=True, linewidth=.5)
plt.show()

Highest Correlation between features is 0.67 and -0.68. This correlation between 'fixed acidity' and 'pH','density', 'citric acid', so we can remove feature 'fixed acidity'. Another highest correlation 0.67 is between 'free sulfur dioxide' and 'total sulfur dioxide' so we can remove one of the feature

In [13]:
# Removed highly correlated feature 'fixed acidity'
data.drop(['fixed acidity'], axis = 1, inplace= True)

# Removed highly correlated feature 'free sulfur dioxide'
data.drop(['free sulfur dioxide'], axis = 1, inplace= True)
data.head()
Out[13]:
volatile acidity citric acid residual sugar chlorides total sulfur dioxide density pH sulphates alcohol quality quality_cat
0 0.70 0.00 1.9 0.076 34.0 0.9978 3.51 0.56 9.4 5 0
1 0.88 0.00 2.6 0.098 67.0 0.9968 3.20 0.68 9.8 5 0
2 0.76 0.04 2.3 0.092 54.0 0.9970 3.26 0.65 9.8 5 0
3 0.28 0.56 1.9 0.075 60.0 0.9980 3.16 0.58 9.8 6 1
4 0.70 0.00 1.9 0.076 34.0 0.9978 3.51 0.56 9.4 5 0
In [14]:
# Correlation after removing feature 'fixed acidity'
plt.figure(figsize=(15,10))
cor_plot = sns.heatmap(data.corr(), cmap="YlGnBu", annot=True, linewidth=.5)
plt.show()

Now we are not having any highly correlated features.

Model Building¶

In [15]:
# lets split the data into train and test dataset 
features = data.iloc[:,:-2]
target = data['quality_cat']
Xtrain, Xtest, Ytrain, Ytest = train_test_split (features, target, test_size = 0.2, random_state = 10, shuffle =True )

Scaling¶

Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem. If one feature is having lower values and other is having much higher values so the feature having higher values will dominate the model predictions so we need to scale all the features in one scale.

In [16]:
# Scaling the data 
# Using StandardScaler 
from sklearn.preprocessing import StandardScaler
SS = StandardScaler()
Xtrain_scaled = SS.fit_transform(Xtrain)
Xtest_scaled = SS.transform(Xtest)

Now we are ready with our train and test data lets import a classification algorithm and start model building.

Compare Different calssification algorithms¶

In [17]:
# Import necessary packages 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
In [18]:
##Spot-Checking Algorithms

models = []

models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))

#testing models

results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    cv_results = cross_val_score(model, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc') # looking for roc auc value as metric
    results.append(cv_results)
    names.append(name)
    model_performance = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
    print(model_performance)
LR: 0.814435 (0.026608)
LDA: 0.815038 (0.027111)
KNN: 0.812069 (0.019961)
CART: 0.746283 (0.035089)
SVM: 0.836117 (0.023499)
RF: 0.888111 (0.017494)
In [19]:
#Compare Algorithms

fig = plt.figure(figsize=(12,10))
plt.title('Comparison of Classification Algorithms')
plt.xlabel('Algorithm')
plt.ylabel('ROC-AUC Score')
plt.boxplot(results)
#ax = fig.add_subplot(111)
plt.show()
The highest ROC_AUC score getting is with SVM and RF algorithm So lets try these two algorithms with different parameters. Highest Scoring and low standard deviation we got for RandomForest model¶

Support Vector Machine¶

In [20]:
# Import support vector machine algorithm
from sklearn import svm
Model with help of different kernals¶
In [21]:
# kernal = rbf
model_svc = svm.SVC(kernel='rbf')
model_svc.fit(Xtrain_scaled, Ytrain)
Ypred = model_svc.predict(Xtest_scaled)
print('roc_auc_score Score:')
print(metrics.roc_auc_score(Ytest,Ypred))
roc_auc_score Score:
0.7408241476038087
In [22]:
# kernal = linear
model_svc = svm.SVC(kernel='linear')
model_svc.fit(Xtrain_scaled, Ytrain)
Ypred = model_svc.predict(Xtest_scaled)
print('roc_auc_score Score:')
print(metrics.roc_auc_score(Ytest,Ypred))
roc_auc_score Score:
0.7420291572833945
In [23]:
# kernal = poly
model_svc = svm.SVC(kernel='poly')
model_svc.fit(Xtrain_scaled, Ytrain)
Ypred = model_svc.predict(Xtest_scaled)
print('roc_auc_score Score:')
print(metrics.roc_auc_score(Ytest,Ypred))
roc_auc_score Score:
0.7292679072340089

Optimization of SVM¶

Optimization with Kernal = 'rbf' and C parameters¶
In [24]:
# cross-validation import from sklearn
from sklearn.model_selection import cross_val_score

C_range = list(range(1,30))

roc_auc_score=[]
for c in C_range:
    svc = svm.SVC(kernel='rbf',C=c)
    scores = cross_val_score(svc,Xtrain_scaled,Ytrain,cv=10,scoring='roc_auc')
    roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.8362592174134151, 0.8367765979648526, 0.8363567348887193, 0.8362332683256811, 0.8361111151091217, 0.8359614535806983, 0.8358145926586932, 0.8357905746100741, 0.8364301383631505, 0.8362333402898718, 0.8359641702288941, 0.8363826120122626, 0.8362127465373229, 0.8355264720275191, 0.8353571822661043, 0.83501396105298, 0.8349179908077741, 0.8346485388870498, 0.8341304147056425, 0.8335421794115249, 0.8331477436827435, 0.8326565640937069, 0.8325328336619698, 0.8320921909257952, 0.8321166167714944, 0.8319459896755375, 0.8313807169552432, 0.831453190892212, 0.8313786899638739]
In [25]:
plt.figure(figsize=(10,6))
C_values = list(range(1,30))
# plot C values in X-axis and cross_validate_accuracy on y-axis
plt.plot(C_values,roc_auc_score)
plt.xticks(np.arange(0,30,1))
plt.xlabel('Value of C for SVM')
plt.ylabel('Cross-Validate roc_auc_score')
Out[25]:
Text(0, 0.5, 'Cross-Validate roc_auc_score')

Highest training score getting at C=2 and Kernal = 'rbf'

In [26]:
# Lets try with gamma 
gamma_range = [0.0001,0.001,0.01,0.1,1,10,100]
roc_auc_score = []
for g in gamma_range:
    svc = svm.SVC(kernel='rbf',gamma=g)
    scores = cross_val_score(svc,Xtrain_scaled,Ytrain,cv=10,scoring='roc_auc')
    roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.8064085251179014, 0.8110681344962746, 0.8201240362795474, 0.8366541628885467, 0.8452108670725448, 0.8418234436544376, 0.7717804510475588]

Highest score is getting at gamma = 1. Lets try some more values between 1 to 10.

In [27]:
# Lets try with gamma 
gamma_range = [1,2,3,4,5,6,7,8,9,10]
roc_auc_score = []
for g in gamma_range:
    svc = svm.SVC(kernel='rbf',gamma=g)
    scores = cross_val_score(svc,Xtrain_scaled,Ytrain,cv=10,scoring='roc_auc')
    roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.8452108670725448, 0.8555459563321293, 0.8576152926303872, 0.8537298740146904, 0.8508775853135478, 0.8478968135455798, 0.8448296247787102, 0.8431356477017037, 0.8416528465435599, 0.8418234436544376]
In [28]:
plt.figure(figsize=(10,6))
C_values = list(range(1,11))
# plot C values in X-axis and cross_validate_accuracy on y-axis
plt.plot(C_values,roc_auc_score)
plt.xticks(np.arange(0,11,1))
plt.xlabel('Value of gamma for SVM')
plt.ylabel('Cross-Validate roc_auc_score')
Out[28]:
Text(0, 0.5, 'Cross-Validate roc_auc_score')

Highest training score at gamma = 3.

We will see different parameter values using Grid Search CV¶

In [29]:
#Finding best parameters for our SVC model
from sklearn.model_selection import GridSearchCV
param = {
    'C': [1,2,3,4,5],
    'kernel':['linear', 'rbf'],
    'gamma' :[1,2,3,4,5]
}
grid_svc = GridSearchCV(model_svc, param_grid=param, scoring='roc_auc', cv=10)
In [30]:
grid_svc.fit(Xtrain_scaled, Ytrain) # fit/train the model
Out[30]:
GridSearchCV(cv=10, estimator=SVC(kernel='poly'),
             param_grid={'C': [1, 2, 3, 4, 5], 'gamma': [1, 2, 3, 4, 5],
                         'kernel': ['linear', 'rbf']},
             scoring='roc_auc')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=SVC(kernel='poly'),
             param_grid={'C': [1, 2, 3, 4, 5], 'gamma': [1, 2, 3, 4, 5],
                         'kernel': ['linear', 'rbf']},
             scoring='roc_auc')
SVC(kernel='poly')
SVC(kernel='poly')
In [31]:
#Best parameters for our svc model
grid_svc.best_params_
Out[31]:
{'C': 1, 'gamma': 3, 'kernel': 'rbf'}
In [32]:
# create model with best parameters 
model_svc_bp = svm.SVC(kernel='rbf',C=1, gamma = 3)
model_svc_bp.fit(Xtrain_scaled, Ytrain)

from sklearn.metrics import classification_report
Ypred = model_svc_bp.predict(Xtest_scaled)
print ('*'*10 + 'Test confusion metrics'+ '*'*10)
print(classification_report(Ytest, Ypred))
**********Test confusion metrics**********
              precision    recall  f1-score   support

           0       0.79      0.57      0.66       143
           1       0.72      0.88      0.79       177

    accuracy                           0.74       320
   macro avg       0.75      0.72      0.73       320
weighted avg       0.75      0.74      0.73       320

Rnadom Forest Algorithm¶

In [33]:
# Random Forest Algorithm
rf = RandomForestClassifier()
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(rf, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc')
msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(msg)
RF: 0.889027 (0.021100)

Hyperparameter Optimization for Random Forest Model¶

In [34]:
param_dist = {'max_depth': [18,19,20,21,22,25],
              'bootstrap': [True, False],
              'max_features': ['auto', 'sqrt', 'log2', None],
              'criterion': ['gini', 'entropy']}

cv_rf = GridSearchCV(rf, cv = 10,
                     param_grid=param_dist, 
                     n_jobs = 3)

cv_rf.fit(Xtrain_scaled, Ytrain)
print('Best Parameters using grid search: \n', cv_rf.best_params_)
Best Parameters using grid search: 
 {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 21, 'max_features': 'auto'}
In [35]:
# build Model using Hyperparamaters
rf = RandomForestClassifier(criterion='entropy', bootstrap = True, max_depth = 21, max_features = 'auto'  )
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(rf, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc')
msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(msg)
RF: 0.891717 (0.019372)

Now we are getting ROC_AUC score = 0.891717 which is slightly higher than any other algorithm.

In [36]:
# Set best parameters given by grid search 
rf.set_params(criterion='entropy', bootstrap = True, max_depth = 21, max_features = 'auto')
Out[36]:
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto')

Evaluate with the Random forest model¶

In [37]:
# OOB rate 
rf.set_params(warm_start=True, 
                  oob_score=True)

min_estimators = 15
max_estimators = 1000

error_rate = {}

for i in range(min_estimators, max_estimators + 1):
    rf.set_params(n_estimators=i)
    rf.fit(Xtrain_scaled, Ytrain)

    oob_error = 1 - rf.oob_score_
    error_rate[i] = oob_error
In [38]:
# Convert dictionary to a pandas series for easy plotting 
oob_series = pd.Series(error_rate)
In [39]:
fig, ax = plt.subplots(figsize=(10, 10))

ax.set_facecolor('#fafafa')

oob_series.plot(kind='line',color = 'red')
plt.axhline(0.055, color='#875FDB',linestyle='--')
plt.axhline(0.05, color='#875FDB',linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
Out[39]:
Text(0.5, 1.0, 'OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
In [44]:
print('OOB Error rate for 200 trees is: {0:.5f}'.format(oob_series[220]))
OOB Error rate for 200 trees is: 0.17514
In [45]:
# Refine the tree via OOB Output
rf.set_params(criterion='entropy', bootstrap = True, max_depth = 21, max_features = 'auto', n_estimators=220, oob_score=False)
Out[45]:
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto',
                       n_estimators=220, warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto',
                       n_estimators=220, warm_start=True)
In [46]:
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(rf, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc')
msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(msg)
RF: 0.890050 (0.018854)
In [47]:
accuracy_rf = rf.score(Xtest_scaled, Ytest)

print("Here is our accuracy on the test set:\n {0:.3f}"\
      .format(accuracy_rf))
Here is our accuracy on the test set:
 0.812
In [48]:
# Here we calculate the test error rate!
test_error_rate_rf = 1 - accuracy_rf
print("The test error rate for our model is:\n {0: .4f}"\
      .format(test_error_rate_rf))
The test error rate for our model is:
  0.1875
In [49]:
# visualize roc curve 
from sklearn.metrics import RocCurveDisplay
disp_roc = RocCurveDisplay.from_estimator(rf, Xtest_scaled, Ytest)
plt.show()
In [50]:
#Confusion Metrics
Ypredict = rf.predict(Xtest_scaled)
cfs_metrics = confusion_matrix(Ytest, Ypredict, labels=rf.classes_)
print (cfs_metrics)
[[112  31]
 [ 29 148]]
In [51]:
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cfs_metrics, display_labels=rf.classes_)
disp.plot(cmap='GnBu')
plt.show()
In [52]:
print(metrics.classification_report(Ytest, Ypredict))
              precision    recall  f1-score   support

           0       0.79      0.78      0.79       143
           1       0.83      0.84      0.83       177

    accuracy                           0.81       320
   macro avg       0.81      0.81      0.81       320
weighted avg       0.81      0.81      0.81       320

Conclusion¶

As per objective of the project our model can predict the given wine as good or bad with accuracy of 81%. As a business requirenment we are concern about both Precision and Recall so we can consider F1-score/ROC_AUC score for model evaluation. More false positives/lower precision would lead to the unnecessary production cost and decrease in the brand value. More false negative/lower recall will not allow to produce good wine. Model based on Random Forest algorithm gave highest F1-score/ROC_AUC score in training and testing. So we can use model based on random forest algorithm for predicting the wine quality. This will help the business to take better decisions with less intervention of the experts.